Image fetching for timeline scrubbing of digital media

ABSTRACT

The present disclosure describes systems and techniques relating to generating three dimensional (3D) models from range sensor data. According to an aspect, frames of range scan data captured using one or more three dimensional (3D) sensors are obtained, where the frames correspond to different views of an object or scene; point clouds for the frames are registered with each other by maximizing coherence of projected occluding boundaries of the object or scene within the frames using an optimization algorithm with a cost function that computes pairwise or global contour correspondences; and the registered point clouds are provided for use in 3D modeling of the object or scene. Further, the cost function, which maximizing contour coherence, can be used with more than two point clouds for more than two frames at a time in a global optimization framework.

CROSS REFERENCE TO RELATED APPLICATIONS

This application claims priority to U.S. Provisional Application Ser. No. 62/014,494, entitled “THREE-DIMENSIONAL MODELING FROM WIDE BASELINE RANGE SCANS”, filed Jun. 19, 2014, which is hereby incorporated by reference.

STATEMENT AS TO FEDERALLY SPONSORED RESEARCH

This invention was made with government support under Grant No. W911NF-04-D0005, awarded by Army Research Office (ARO). The government has certain rights in the invention.

BACKGROUND

The present disclosure describes systems and techniques relating to generating three dimensional (3D) models from range sensor data.

Registering two or more range scans is a fundamental problem, with application to 3D modeling. While this problem is well addressed by existing techniques such as ICP (Iterative Closest Point) when the views overlap significantly, no satisfactory solution exists for wide baseline registration. For rigid modeling, most prevailing methodologies either suggest using all consecutive frames from a single range sensor in an incremental manner or merge the range scans of multiple pre-calibrated range sensors. For articulated modeling, such as of the human body, the state-of-the-art methods typically require taking multiple frames, ranging from eight to as many as forty, of different viewpoints of the moving subject. All frames are first aligned in pairs, then registered globally, both in an articulated manner. All methods above require substantial amount of overlap among neighboring frames.

SUMMARY

The present disclosure describes systems and techniques relating to generating three dimensional (3D) models from range sensor data, which includes an approach to 3D modeling that leverages contour coherence and allows alignment of two wide baseline range scans with limited overlap. The contour coherence may be maximized by iteratively building robust corresponding pairs on apparent contours and minimizing their distances. The contour coherence may be used under a multi-view rigid registration framework, and this enables the reconstruction of accurate and complete 3D models from as few as four frames. Moreover, the systems and techniques can be further extended to handle articulations. After modeling with a few frames, in case higher accuracy is required, more frames can be readily added in a drift-free manner by a conventional registration method. Experimental results on both synthetic and real data demonstrate the effectiveness and robustness of the contour coherence-based registration approach to wide baseline range scans, and to 3D modeling.

In some implementations, a method performs registration by maximizing the contour coherence, instead of the shape coherence which is suggested by other the registration algorithms. As such, the error is minimized in the 2D image plane, instead of in the 3D coordinate system. This permits the registration of two wide baseline range scans with limited overlap. In some implementations, the method enables performance of global registration even from poorly initialized range scans, thus avoiding the need for pair-wise registration. Further, in some implementations, the method permits complete modeling of both rigid and articulated objects from as few as four range scans.

According to an aspect of the described systems and techniques, frames of range scan data captured using one or more three dimensional (3D) sensors are obtained, where the frames correspond to different views of an object or scene; point clouds for the frames are registered with each other by maximizing coherence of projected occluding boundaries of the object or scene within the frames using an optimization algorithm with a cost function that computes pairwise or global contour correspondences; and the registered point clouds are provided for use in 3D modeling of the object or scene, or for other processes and applications. The registering can include concurrently registering more than two different point clouds corresponding to the frames of range scan data, and the cost function can compute global contour correspondences for projections of the more than two different point clouds across two dimensional image planes of the frames. Thus, the cost function, which maximizing contour coherence, can be used simultaneously with multiple point clouds for multiple frames in a global optimization framework that registers all the point clouds of all the frames with each other at the same time.

According to another aspect of the described systems and techniques, frames of range scan data captured using one or more 3D sensors are obtained, where the frames correspond to different views of an object or scene; visibility pairings are generated for the frames, where a pairing of a first frame with a second frame indicates that (i) at least a portion of the object or scene represented in the second frame is visible in the first frame and (ii) at least a portion of the object or scene represented in the first frame is visible in the second frame; registration between frames in each of the visibility pairings is performed by maximizing contour coherence of the range scan data for the object or scene, including minimizing (i) distance between contour correspondences for the first frame and the second frame of the pairing in a first two dimensional image plane of the first frame and (ii) distance between contour correspondences for the second frame and the first frame of the pairing in a second two dimensional image plane of the second frame; and the visibility pairings for the frames are updated and the registration is repeated, iteratively, until convergence. According to additional aspects, various computer systems, and computer program products, encoded on computer-readable mediums, effect implementation of the algorithms described.

In various implementations, one or more of the following features and advantages can be provided. A modeling system can take only four range scans as input and still accurately register the data from the four range scans together for generation of a full 3D model of the scanned object or scene, hence greatly reducing both the data acquisition and processing complexity. The method can minimize the global error, and as such the modeling result is globally accurate. Moreover, in case higher accuracy is preferred, more frames can be readily added to the existing model in a drift-free manner.

The above and other aspects and embodiments are described in greater detail in the drawings, the description and the claims.

DESCRIPTION OF DRAWINGS

FIGS. 1A-1F show an example of registration of wide baseline range scans.

FIG. 2 is a block diagram showing an example of a system for registering frames of range scan data by maximizing contour coherence.

FIG. 3 is a flowchart showing an example of a process for registering frames of range scan data by maximizing contour coherence.

FIG. 4 is a flowchart showing an example of a process of rigid registration between range scan pairings.

FIG. 5 shows a process pipeline for a Robust Closest Contour (RCC) algorithm.

FIG. 6 shows a process pipeline for a Multi-view Iterative Closest Contour (M-ICC) algorithm.

FIG. 7 shows a process pipeline for a Multi-view Articulated Iterative Closest Contour (MA-ICC) algorithm.

Like reference numbers and designations in the various drawings indicate like elements.

DETAILED DESCRIPTION

Registering two or more range scans is a fundamental problem, with application to three dimensional (3D) modeling. While this problem is well addressed by existing techniques such as Itterative Closest Point (ICP) algorithms when the views overlap significantly, no satisfactory solution exists for wide baseline registration. We propose here a novel approach which leverages contour coherence and allows us to align two wide baseline range scans with limited overlap including, potentially, from a poor initialization. We maximize the contour coherence by building robust corresponding pairs on apparent contours and minimizing their distances in an iterative fashion. The contour coherence can be used under a multi-view rigid registration framework, which enables the reconstruction of accurate and complete 3D models from as few as four frames. We further extend this technique to handle articulations. After modeling with a few frames, in case higher accuracy is required, more frames can be readily added in a drift-free manner by a conventional registration method. Experimental results on both synthetic and real data demonstrate the effectiveness and robustness of this contour coherence based registration approach to wide baseline range scans, and to 3D modeling.

FIGS. 1A-1F show an example of registration of wide baseline range scans. Registering two wide baseline range scans presents a challenging task where two range scans barely overlap and the shape coherence no longer prevails. FIG. 1A shows an example 100 of two wide baseline range scans of the Stanford bunny with approximately 40% overlap. The traditional shape coherence based methods may fail for this example 100 as most closest-distance correspondences are incorrect.

In computer vision dealing with intensity images, a large body of work has been devoted to study the apparent contour, or simply contour. An apparent contour is the projection of a contour generator, which is defined as the set of points on the surface where the tangent plane contains the line of sight from the camera. This contour has been shown to be a rich source of geometric information for motion estimation and 3D reconstruction. Inspired by this work, we propose the concept of contour coherence for wide baseline range scan registration. Contour coherence is defined as the agreement between the observed apparent contour and the predicted apparent contour.

The example 100 in FIG. 1A shows a first meshed point cloud (in grey) generated from the range scan data from a first camera 110 (e.g., a 2.5D range scan from a PRIMESENSE™ camera or other device with a depth sensor) roughly aligned with a second meshed point cloud (in black) from the range scan data from a second camera 120 (e.g., a 2.5D range scan from a PRIMESENSE™ camera or other device with a depth sensor). Note that the description here and below refers to first and second cameras 110 and 120, but as will be appreciated, these may be the same device taking range scans of the object at two different times, between which times the object has moved (e.g., a person turning in front of a single depth image camera to generate frames of range scan data) or the camera has moved (e.g., the camera moves in an arc or circle around a person to generate frames of range scan data). Thus, the references to camera pose herein refer to the orientation of the camera with respect to the object, regardless of whether the object moves, the camera moves, there are more than one camera around the object, or a combination of these.

In FIG. 1A, the observed and predicted apparent contours are shown in both a first image plane 115 for camera-1 110 and a second image plane 125 for camera-2 120. FIG. 1B shows an expanded and straighton view of the first image plane 115 for camera-1. In FIG. 1B, the observed contours 112 extracted from the range scan data captured by camera-1 do not match the predicted contours 114 extracted from the projected range scan data captured by camera-2. Likewise, FIG. 1C shows an expanded and straighton view of the second image plane 125 for camera-2, in which the observed contours 118 extracted from the range scan data captured by camera-2 do not match the predicted contours 116 extracted from the projected range scan data captured by camera-1.

As described herein, the contour coherence (between the observed apparent contours 112 and the predicted apparent contours 114 in the first image plane 115, and between the observed apparent contours 118 and the predicted apparent contours 116 in the second image plane 125) can be maximized by iteratively building robust correspondences among apparent contours and minimizing their distances. FIG. 1D shows the registration result 130 with the contour coherence maximized and two wide baseline range scans well aligned. FIG. 1E shows the observed contours 132 extracted from the range scan data captured by camera-1 accurately aligned with the predicted contours 134 extracted from the projected range scan data captured by camera-2 in the first image plane 115 for camera-1. Likewise, FIG. 1F shows the observed contours 138 extracted from the range scan data captured by camera-2 accurately aligned with the predicted contours 136 extracted from the projected range scan data captured by camera-1. Note that the contour coherence is robust in the presence of wide baseline in the sense that only the shape area close to the predicted contour generator is considered when building correspondences on the contour, thus avoiding the search for correspondences over the entire shape.

FIG. 2 is a block diagram showing an example of a system for registering frames of range scan data by maximizing contour coherence. One or more computers 200 each include processing electronics, such as one or more processors 210, and one or more memory devices 220 for holding computer instructions and data. Additional devices can also be included such as a long term storage device (e.g., a hard disk drive), a communication interface device (e.g., a wireless transceiver), and one or more user interface devices 230 (e.g., a display, a camera, a speaker, a microphone, a tactile feedback device, a keyboard, and a mouse). As used herein, a “computer-readable medium” refers to any physical, non-transitory medium (e.g., a magnetic medium in the hard disk drive or a semiconductor medium in a memory device 220) on which a computer program product can be encoded to cause data processing apparatus (e.g., a computer 200) to perform the operations described herein.

In some implementations, a 3D range sensor 240 is included in the system, and the range sensor data is actively captured for processing. For example, in a body scan implementation, the object that is scanned can be a person 260. Moreover, as will be appreciated, various levels of integration of components can be used in various implementations. For example, a sensor 240 can be integrated into a camera 230, which in turn can be integrated into the computer 200, or alternatively be separate therefrom. Moreover, the user interface device(s) 230 can be entirely separate from the computer(s) 200 that do the processing of range scan data to maximize contour coherence. For example, the 3D sensor 240 can be a PRIMESENSE™ Kinect camera (available from Microsoft Corporation of Redmond, Wash.) attached to home computer system 230 (e.g., a gaming and/or web access machine) that communicates with a server computer 200 through a network 250 (e.g., the Internet). Thus, in some implementations, the range sensor data is not actively captured, but rather is received from a third party source, processed, and then delivered to the desired recipient of the 3D model.

Low-cost structured light 3D sensors, such as found in the Kinect camera, can enable handy complete scanning of objects in the home environment. However, in such environments, it can be difficult to generate range scan data with small motion between consecutive views, which may be required by traditional methods for generating 3D reconstruction results of rigid or articulated objects. Using the systems and techniques described herein, less care need be taken to ensure the user moves the 3D sensor carefully around the object or turns the subject slowly in a controlled manner in order to achieve good 3D reconstruction results. Moreover, drifting error can be addressed, and the gaps visible with traditional techniques can be closed.

FIG. 3 is a flowchart showing an example of a process for registering frames of range scan data by maximizing contour coherence. Frames of range scan data are obtained 300. This can involve actually capturing the range scan data, using camera systems such as described above, or this can involve receiving range scan data that was previously captured by another. In any case, the range scan data is data that was captured using one or more 3D sensors (e.g., a Kinect camera), and the frames of the range scan correspond to different views of an object, regardless of whether taken by one camera or more than one camera.

Initial visibility pairings can be generated 310 for the frame, where a pairing of a first frame with a second frame indicates that at least a portion of the object represented in the second frame is visible in the first frame, and at least a portion of the object represented in the first frame is visible in the second frame. For example, each set of two frames that have their viewing directions within 120 degrees of each other can be paired. Other angles and approaches for generating the visibility pairings are also possible.

The frames in each of the pairings can then be registered 320 with each other by maximizing contour coherence of the range scan data for the object, including minimizing distance between contour correspondences for the first frame and the second frame of the pairing in a first two dimensional image plane of the first frame, and distance between contour correspondences for the second frame and the first frame of the pairing in a second two dimensional image plane of the second frame. Thus, contour coherence can be employed under a multi-view framework to develop a Multi-view Iterative Closest Contour (M-ICC) algorithm. Maximizing contour coherence, i.e., the agreement between the observed and predicted apparent contours can be used to perform wide baseline range scan registration, where the M-ICC algorithm can alternate between finding closest contour correspondences and minimizing their distances.

A check can be made 330 for convergence, and until convergence is reached, the process can continue to iteratively update visibility pairings 310 for the frames and repeat the registration 320 until convergence is achieved. A stopping condition to identify convergence 330 for the iterative process can be: (1) the maximum iteration number has been achieved (e.g., a maximum of 50 iterations), or (2) the distance per contour correspondence is relatively small (e.g., 0.01 or less), or (3) the decrease in distance per contour correspondence is relatively small (e.g., 0.001 or less). Traditional techniques in gradient descent type optimization can be used. In practice, the algorithm will typically reach convergence within ten iterations. The registration 320 can employ a Robust Closest Contour (RCC) algorithm, which is described in detail below, to establish robust contour correspondences on each a pair of range scans, either in a rigid registration process, an articulated registration process, or both; and the cost function can be minimized with the Levenberg-Marquardt algorithm.

FIG. 4 is a flowchart showing an example of a process of rigid registration between range scan pairings. A first meshed point cloud is generated 400 from the range scan data of a first frame of a pairing, and a second meshed point cloud is generated 410 from the range scan data of a second frame of the pairing. While there are more pairings 420 in the initial set of pairings, this process is repeated, but note that the mesh point cloud need only be created once for each frame of the range scan data, and since the initial pairings can overlap (i.e., frame 2 can be paired with both frame 1 and with frame 3) any previously generated meshed point cloud can be used for a frame in any of its pairings. In addition, in some implementations, the meshed point clouds are generated for the frames to be registered with each other before any pairings among the frames are made.

A 2.5D range scan R_(i) of frame i provides depth value R_(i)(u) at each image pixel u=(u,v)^(T)εR². A single constant camera calibration matrix K can be used, which transforms points from the camera frame to the image plane. We represent V_(i)(u)=K⁻¹R_(i)(u)ũ as the back-projection operator which maps u in frame i to its 3D location, where ũ denotes the homogeneous vector ũ=[u^(T)|1]^(T). Inversely, we denote the projection operator as P(V_(i)(u))=g(KV_(i)(u)) where g represents dehomogenisation.

A meshed points cloud P_(i) is generated for each frame i considering the connectivity on the range scan R_(i). We calculate the normalized 3D normal at each pixel N_(i)(u)εR³, e.g., following the techniques described in R. A. Newcombe, et al. “Kinectfusion: Real-time dense surface mapping and tracking.” In ISMAR, pages 127-136. IEEE, 2011. N_(i)(u) is further projected back to the image to obtain normalized 2D normal n_(i)(u) of each image pixel.

Registration can be performed 430 between range scans for each pair of frames of range scan data. Thus, for a given pair (with first and second frames) the second meshed point cloud can be projected 440 to the first two dimensional image plane in accordance with camera pose information (e.g., a separate camera pose for each camera) to generate first predicted range scan data for the first frame, and the first meshed point cloud can be projected 440 to the second two dimensional image plane in accordance with the camera pose information to generate second predicted range scan data for the second frame. In general, projecting P_(j) to the ith image, given current camera poses, leads us to a projected range scan R_(j→i). The inputs to the RCC algorithm are thus observed and predicted range scans, namely R_(i) and R_(j→i), and the output is robust contour correspondences M_(i,j→i) (see equation 5 and equation 9 below).

FIG. 5 shows a process pipeline 500 for the RCC algorithm with these inputs and output. The RCC pipeline 500 includes extracting countour points 510, pruning contour points 520, and performing bijective closest matching 530. Referring again to FIG. 4, a first set of contour points can be extracted 450 from the captured range scan data for the first frame and from the first predicted range scan data for the first frame, and a second set of contour points can be extracted 450 from the captured range scan data for the second frame and from the second predicted range scan data for the second frame. Given pixels belonging to the object in frame i as U_(i), we set R_(i)(u)=∞ for uε/U_(i). The contour points C_(i) are extracted considering the depth discontinuity of a pixel and its 8-neighboring pixels,

C _(i) ={uεU _(i) |∃vεN _(u) ⁸ ,R _(i)(v)−R _(i)(u)>ζ},  (1)

where ζ is the depth threshold, which can be set as 50 mm. We can also extract a set of occlusion points,

O _(i) ={uεU _(i) |∃vεN _(u) ⁸ ,R _(i)(u)−R _(i)(v)>ζ},  (2)

which are boundary points of surface holes created by occlusion, and the depth threshold ζ can again be set as 50 mm.

Contour points can be removed 460 from the first set to produce first captured and pruned contour points and first predicted and pruned contour points for the first frame, and contour points can be removed 460 from the second set to produce second captured and pruned contour points and second predicted and pruned contour points for the second frame. Both C_(i) and C_(j→i) should be pruned before the matching stage to avoid possible incorrect correspondences. First due to the self-occlusion of frame j, C_(j→i) contains false contour points which are actually generated by the meshes in P_(j) connected with C_(j) and O_(j). We mark and remove them to generate the pruned contour points C_(j→i) ^(p). Second again due to the self occlusion of frame j, some contour points in C_(i) should not be matched with any contour point in C_(j→i) ^(p), e.g., the contour points in frame 2 belonging to the back part of the object, which are not visible in view 1. Hence we prune C_(i) based on the visibility of the corresponding contour generator in view j,

C _(i/j) ^(p) ={uεC _(i) |N _(i)(u)^(T)·(o _(j→i) −V _(i)(u))>0},  (3)

where o_(j→i) is the camera location of frame j in camera i.

Bijective closest matching can be performed 470 in three dimensions between the first captured and pruned contour points and the first predicted and pruned contour points for the first frame, and bijective closest matching can be performed 470 in three dimensions between the second captured and pruned contour points and the second predicted and pruned contour points for the second frame. After pruning, a one-way closest matching algorithm between C_(i/j) ^(p) and C_(j→i) ^(p) still fails, as contour points are sensitive to minor changes in viewing directions, e.g., camera 1 observes only one leg of an object while the contour points of two legs of the object are extracted from the projected range scan. Hence a bijective matching scheme can be used (e.g., using the techniques described in M. Zeng, et al. “Templateless quasi-rigid shape modeling with implicit loop-closure.” In CVPR, pages 145-152, 2013.) when establishing robust correspondences (see equation 5 and equation 9 below).

Matching directly in the 2D image space can lead to many wrong corresponding pairs. The ambiguity imposed by the 2D nature can be resolved by relaxing the search to the 3D, as we have the 3D point location V_(i)(u) for each contour point. It is worth mentioning that while we build correspondences in 3D, we are minimizing the distances between contour correspondences in 2D, as the real data given by most structured-light 3D sensors is extremely noisy along the rays of apparent contour. Thus, point cloud registration is achieved for two frames using projected occluding boundaries of the range scan data of the two frames.

Note that lots of information is lost when working in 2D and so the correspondences cannot be found in 2D. Finding correspondence in 3D space will reduce the ambiguity of correspondences, but minimizing the distances should be done in the 2D image plane because the noise in the depth image on the countour points is large. Thus, trying to minimize the distances directly in 3D can cause substantial problems.

FIG. 6 shows a process pipeline 600 for a Multi-view Iterative Closest Contour (M-ICC) algorithm. The M-ICC algorithm can be used to rigidly align all range scans at the same time. Given N roughly initialized range scans, the algorithm can alternate between updating 610 a view graph (visibility list), establishing 620 robust contour correspondences from pairs of range scans in the view graph (e.g., using the RCC algorithm), and minimizing 630 distances between all correspondences until convergence 640. While a traditional pairwise ICP algorithm fails in the presence of wide baseline, the present registration method recovers accurate camera poses.

Frame i is associated with a 6 DOF (degrees of freedom) rigid transformation matrix

${{}_{}^{}{}_{}^{}} = \begin{bmatrix} R_{i} & t_{i} \\ o^{T} & 1 \end{bmatrix}$

where R_(i) is parameterized by a 3 DOF quaternion, namely q_(i)=[q_(i) ^(w),q_(i) ^(y),q_(i) ^(y),q_(i) ^(z)] with ∥q∥₂=1, and t_(i) is the translation vector. Operator ^(w)π_(i)(u)=^(w)T_(i){tilde over (V)}_(i)(u) transforms pixel u to its corresponding homogeneous back-projected 3D point in the world coordinate system, where ^(w){tilde over (V)}_(i)(u) is the homogenous back-projected 3D point in the camera coordinate system of fame i. Inversely, we have operator ^(i)π_(w) such that u=^(i)π_(w)(^(w)π_(i)(u))=P(g(^(i)T_(w) ^(w)π_(i)(u))). Given N frames, we have a total 6×N of parameters stored in a vector θ.

Unlike other approaches, where pairwise registration is performed before a final global error diffusion step, we do not require pairwise registration and explicitly employ contour coherence under a multi-view framework. We achieve that by associating two camera poses with a single contour correspondence. Assuming u and v is a corresponding pair belonging to frame i and frame j respectively, then their distance is modeled as ∥v−^(j)π_(w)(^(w)π_(i)(u))∥₂. Minimizing this distance updates both camera poses at the same time, which allows us to globally align all frames together. It is worth mentioning that pairwise registration is a special case of our multi-view scenario in a way that the pairwise registration ²T₁ is achieved as ²T₁=²T_(w) ^(w)T₁.

A view graph (or visibility list) L is a set of pairing relationships among all frames. (i,j)εL indicates that frame j is visible in frame i and hence robust contour correspondences should be established between R_(i) and R_(j→i). Each frame's viewing direction in the world coordinate is R_(i)(0,0,1)^(T) and frame j is viewable in frame i only if their viewing directions are within a certain angle n, i.e.,

L={(i,j)|a cos((0,0,1)R _(i) ^(T) R _(j)(0,0,1)^(T))<n}  (4)

It is worth mentioning that (i,j)≠(j,i), and we establish two pairs of correspondences between frame i and frame j, namely between C_(j→i) ^(p) and C_(i/j) ^(p), and between C_(i→j) ^(p) and C_(j/i) ^(p).

Another issue worth raising is that the loop closure is automatically detected and achieved if all N views form a loop. For example, if L is calculated as {(1; 2); (2; 1); (2; 3); (3; 2); (1; 4); (4; 1)} from θ_(initial) with N=4, i.e., the gap between frame 3 and frame 4 is large and the loop is not closed from the beginning, then as we iterate and update the camera poses, link {(3; 4); (4; 3)} is added to L and we automatically close the loop. Thus, performing the registration process can involve globally aligning all the frames together in each iteration, and automatically detecting and achieving loop closure if a loop among all views is present.

For each viewable pair (i,j)εL, robust contour correspondences M_(i,j→i) can be extracted between C_(j→i) ^(p) and C_(i/j) ^(p) using the RCC algorithm as

M _(i,j→i)={(u, ^(j)π_(w)(^(w)π_(i)(v)))  (5)

v=arg min d(V _(i)(u),V _(j→i)(m)),

mεC _(j→i) ^(p)

u=arg min d(V _(j→i)(v),V _(i)(n))}

nεC _(i/j) ^(p)

where d(x,y)=∥x−y∥₂ is the Euclidean distance operator. Pixel v is the closest (i.e., distance in the back-projected 3D space) point on the pruned predicted contour to pixel u on the pruned observed contour, while at the same time pixel u is also the closest to pixel v, i.e., the bijectivity in 3D is imposed.

The minimization can be the sum of point-to-plane distances of all contour correspondences as

$\begin{matrix} {ɛ_{R} = {\sum\limits_{{({i,j})} \in L}{\sum\limits_{{({u,v})} \in M_{i,{j\rightarrow i}}}{{{\left( {u - {{{}_{}^{}{}_{}^{}}\left( {{{}_{}^{}{}_{}^{}}(v)} \right)}} \right)^{T} \cdot {n_{i}(u)}}}.}}}} & (6) \end{matrix}$

In practice, we find that the point-to-plane error metric allows two contours sliding along each other and reaching better local optimum than the point-to-point error metric.

To handle articulations, the M-ICC algorithm can be extended to a Multi-view Articulated Iterative Closest Contour (MA-ICC) algorithm. For example, the M-ICC algorithm 600 can be applied to the frames of range scan data as initialization to roughly align the frames of range scan data, and then a further process of registraction can be performed using segmentation and articulation.

Segmentation information can be obtained for range scan data in at least one frame, where the segmentation information locates all rigid parts in the at least one frame. This can involve receiving a known segmentation for a known frame, such as knowing that the first frame will include a person facing toward the camera with arms and legs spread. Alternatively, this can involve identifying one of the frames that most closely matches a segmentation template for the object scanned (e.g., a human body template composed of a head, a torso, two arms, a waist, two upper leg portions, and two lower leg portions), and then creating the segmentation information based on this template. Various segmentation algorithms can be used in various implementations, such as described in Shotton, Jamie, et al. “Real-time human pose recognition in parts from single depth images.” Communications of the ACM 56.1 (2013): 116-124.

Range scan data in additional frames can be segmented in accordance with current alignments among the frames and the segmentation information for the range scan data in the at least one frame. Segmented visibility pairings can be generated for the segmented frames, where a segmented pairing of a first segmented frame with a second segmented frame indicates that (i) one of the rigid parts located in the second frame is visible in the first frame and (ii) one of the rigid parts located in the first frame is visible in the second frame. Segmented registration between each of the segmented visibility pairings can be performed by maximizing contour coherence of the range scan data for one or more rigid parts in each segmented pairing, including minimizing (i) distance between contour correspondences for the first segmented frame and the second segmented frame in a first two dimensional image plane of the first segmented frame and (ii) distance between contour correspondences for the second segmented frame and the first segmented frame in a second two dimensional image plane of the second segmented frame. Finally, the process of segmenting the range scan data in the additional frames, updating the segmented visibility pairings for the frames and performing the segmented registration can be iteratively repeated until convergence.

FIG. 7 shows a process pipeline 700 for a Multi-view Articulated Iterative Closest Contour (MA-ICC) algorithm. Given N range scans, articulation structure as well as known segmentation W₁ of all rigid parts in the first frame, all range scans can initially be regarded as rigid, and the M-ICC algorithm 600 can be applied to roughly align all the range scans. Then the MA-ICC algorithm iteratively segments 710 other frames, updates 720 the view graph, establishes 730 robust contour correspondences (e.g., using RCC) and minimizes 740 until convergence.

A standard hierarchical structure can be employed for the segmentation, where each rigid segment k of frame i has an attached local coordinate system related to the world coordinate system via transform ^(w)T_(k) ^(i). This transformation is defined hierarchically by recurrence ^(w) T_(k) ^(i)=^(w)T_(k) _(p) ^(i k) ^(p) T_(k) ^(i) where k_(p) is the parent node of k. For the root node, we have ^(w)T_(root) ^(i)=^(w)T_(i) where ^(w)T_(i) can be regarded as camera pose of frame i. ^(k) ^(p) T_(k) ^(i) has a parameterized rotation component and a translation component completely dependent on the rotation component. As such, for a total of N range scans where the complete articulated structure contains M rigid segments, there is a total number of N×(M×3+3) parameters stored in the vector θ.

A Linear Blend Skinning (LBS) scheme can be employed, where each pixel u in frame i is given a weight vector W_(i)(u)ε

^(M) with

${{\sum\limits_{j = {1\; \ldots \; M}}{W_{i}(u)}_{j}} = 1},$

indicating its support from all rigid segments. As such, operator ^(w)π_(i) in the rigid case is rewritten as

${{{}_{}^{}{}_{}^{}}(u)} = {\sum\limits_{j = {1\; \ldots \; M}}{{{}_{}^{}{}_{}^{}}{{\overset{\sim}{V}}_{i}(u)}{W_{i}(u)}_{j}}}$

in the articulated case, which is a weighted transformation of all rigid segments attached to u. Similarly we have operator ^(i)π_(w) ^(A) as the inverse process such that u=^(i)π_(w) ^(A)(^(w){tilde over (V)}_(i)(u)).

Given the segmentation W₁ of the first frame and predicted pose θ, pixel uεU_(i) of frame i can be segmented as

$\begin{matrix} {{{W_{i}(u)} = {W_{1}\left( {\underset{v \in u_{1}}{\arg \; \min \; d}\left( {v,{{{}_{}^{}{}_{}^{}}\left( {{{}_{}^{}{}_{}^{}}(u)} \right)}} \right)} \right)}},} & (7) \end{matrix}$

i.e., the same weight as the closest pixel in the first frame. Further, to simplify the following discussion, we define F(S,k)={uεS|W_(s)(u)_(k)=1}|, which indicates the subset of S with pixels exclusively belonging to the k-th rigid part.

In the presence of articulation, contour correspondences need only be built on the corresponding rigid body parts, and as such, (i,j,k)εL^(A) indicates that rigid segment k of frame j is viewable in frame i, and robust contour correspondences should be built among F(C_(i/j) ^(p),k). Besides considering the viewing direction of cameras, self-occlusion can be considered and contour correspondences can be built only when there are enough contour points (i.e., more than γ) belonging to the rigid segment kin both views,

L ^(A)={(i,j,k)|a cos((0,0,1)R _(i) ^(T) R _(j)(0,0,1)^(T))<η,  (8)

#(F(C _(i/j) ^(p) ,k))>γ,#(F(C _(j→i) ^(p) ,k))>γ|}

In addition, for each viewable pair (i,j,k)εL^(A), the set of bijective contour correspondences M_(i,j→i,k) ^(A) between F(C_(i/j) ^(p),k) and F(C_(j→i) ^(p),k) can be extracted by RCC as

M _(i,j→i,k) ^(A)={(u, ^(j)π_(w) ^(A)(^(w)π_(i) ^(A)(v)))  (9)

v=arg min d(V _(i)(u),V _(j→i)(m)),

mεF(C _(j→i) ^(p) ,k)

u=arg min d(V _(i)(n),V _(j→i)(v))},

nεF(C _(i/j) ^(p) ,k)

and the sum of point-to-plane distances can be minimized between all contour correspondences by

$\begin{matrix} {{ɛ_{R} = {{\sum\limits_{{({i,j,k})} \in L^{A}}{\sum\limits_{{({u,v})} \in M_{{{ij}\rightarrow i},k}^{A}}{{\left( {u - {{{}_{}^{}{}_{}^{}}\left( {{{}_{}^{}{}_{}^{}}(v)} \right)}} \right)^{T} \cdot {n_{i}(u)}}}}} + {\alpha \; {\theta^{T} \cdot \theta}}}},} & (10) \end{matrix}$

where αθ^(T)·θ is used as the regularization term favoring the small articulation assumption (i.e., the articulations introduced by subject's motion while the subject is trying to keep the same global pose during motion). Note that the magnitude of the additional term linearly increases as the angle of articulation increases.

As before, the stopping condition for the iterative process can be: (1) the maximum iteration number has been achieved, or (2) the distance per contour correspondence is relatively small, or (3) the decrease in distance per contour correspondence is relatively small. For each iteration, equation 6 and equation 10 are non-linear in parameters. As such, the Levenberg-Marquardt algorithm can be employed as the solver. In some implementations, the Jacobian matrix for the Levenberg-Marquardt algorithm is calculated by the chain rule.

The depth discontinuity threshold can be set as ζ=50 mm in equations 1 and 2. The viewable angle threshold can be set at n=120° in equations 4 and 8, while the rigid segment minimum number of points threshold can be set as γ=500 in equation 8. The weight for regularizer can be set as α=100 in equation 10. However, it is worth mentioning that for most if not all range scanning devices (e.g., the Kinect sensor) the parameters work for a large range and do not require specific tuning.

In practice, the M-ICC and MA-ICC algorithms can converge within ten iterations. Specifically for rigid registration, within each iteration and for each pairing, two projections are performed, two sets of robust contour correspondences are extracted, and the cost function can be minimized with the Levenberg-Marquardt algorithm. Since the projection can be readily parallelized and the closest-point matching is searching over a limited number of contour points in 2D space, the algorithms described herein can readily run in real time on a GPU (Graphics Processing Unit).

When modeling rigid objects, four depth images of rigid objects can be captured at approximately 90° apart using a single Kinect sensor. The background depth pixels can be removed from the range scans by simply thresholding depth, detecting and removing planar pixels using RANSAC (see M. A. Fischler and R. C. Bolles. “Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography.” Communications of the ACM, 24(6):381-395, 1981.), assuming the object is the only thing on the ground or table. The four wide baseline range scans can be initialized by aligning the centers together and assuming a pairwise 90° rotation angle. The four frames can be registered using the M-ICC algorithm, and the Poisson Surface Reconstruction (PSR) algorithm (see M. Kazhdan, et al. “Poisson surface reconstruction.” In Proceedings of the fourth Eurographics symposium on Geometry processing, 2006.) can be used to generate the final water-tight model.

In some implementations, the object can be placed on a turntable and rotated to generate the frames of range scan data. In some implementations, smooth 3D object models can be reconstructed from the scans using only four views of the real-world object. In some implementations, the human body is scanned by turning the subject in front of a fixed sensor while showing four key poses, i.e., front, left, back and right, in that order. In implementations that use a sensor with a limited field of view, such as the Kinect sensor, rather than have the subject stand far away from the sensor, which results in degredation of the input data quality, the subject can be asked to come closer to the sensor and stay rigid for five to ten seconds at each key pose while the sensor, controlled by a built-in motor, swipes up and down to scan the subject (e.g., using KinectFusion). The reconstructed partial 3D scene can be further projected back to generate a super-resolution range scan.

After acquiring 4 super-resolution range scans of the subject, these can be aligned using the MA-ICC algorithm. Segmentation of the first range scan can be performed by heuristically segmenting the bounding contour and then assigning the same weight to each pixel as of its closest bounding contour. In some implementations, the whole body can be segmented into nine parts (e.g., a head, a torso, two arms, a waist, two upper leg portions, and two lower leg portions). In other implementations, other segmentation algorithms can be applied as input, such as described in Shotton, Jamie, et al. “Real-time human pose recognition in parts from single depth images.” Communications of the ACM 56.1 (2013): 116-124. After registration, PSR can be used again to generate the final watertight model. Thus, accurate and complete human body models can be generated from as few as four views with a single sensor.

The processes described above, and all of the functional operations described in this specification, can be implemented in electronic circuitry, or in computer hardware, firmware, software, or in combinations of them, such as the structural means disclosed in this specification and structural equivalents thereof, including potentially a program (stored in a machine-readable medium) operable to cause one or more programmable machines including processor(s) (e.g., a computer) to perform the operations described. It will be appreciated that the order of operations presented is shown only for the purpose of clarity in this description. No particular order may be required for these operations to achieve desirable results, and various operations can occur simultaneously or at least concurrently. In certain implementations, multitasking and parallel processing may be preferable.

The various implementations described above have been presented by way of example only, and not limitation. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.

Thus, the principles, elements and features described may be employed in varied and numerous implementations, and various modifications may be made to the described embodiments without departing from the spirit and scope of the invention. Accordingly, other embodiments may be within the scope of the following claims. 

1. A method comprising: receiving, by a digital media receiver at a first time prior to receiving a first scrub command, a first set of scrub images associated with digital media, the first set of scrub images having a first set of positions on a timeline of the digital media; receiving, by the digital media receiver, the first scrub command; receiving, by the digital media receiver at a second time after the first time, a second set of scrub images associated with the digital media, the second set of scrub images having a second set of positions on the timeline that fill time gaps in the first set of positions on the timeline; responsive to the first scrub command: animating, by the digital media receiver, a timeline overlay including the timeline, a playhead and a scrub image window; and selecting, by the digital media receiver, a scrub image from the first or second sets of scrub images for presentation in the scrub image window, the selecting based on a playhead position on the timeline.
 2. The method of claim 1, wherein the digital media is a sequence of video frames and the first and second sets of scrub images represent individual frames of the sequence of video frames.
 3. The method of claim 1, wherein the first and second sets of scrub images are received while the digital media is playing on a media presentation device coupled to the digital media receiver.
 4. The method of claim 1, wherein the timeline overlay is animated to move the scrub image window with the playhead in response to scrubbing.
 5. The method of claim 4, wherein selecting a scrub image from the first or second sets of scrub images for presentation in the scrub image window further comprises: selecting, by the digital media receiver, a scrub image having a position on the timeline that is closest to the playhead position on the timeline.
 6. A method comprising: receiving, by a digital media receiver, a playlist including resource identifiers to scrub images associated with a digital media; sending, by the digital media receiver, a first request to a network media server for a first set of scrub images using a first set of resource identifiers; receiving, by the digital media receiver at a first time prior to receiving a scrub command, the first set of scrub images associated with digital media, the first set of scrub images having a first set of positions on a timeline of the digital media; receiving, by the digital media receiver, a scrub command; sending, by the digital media receiver, a second request to the network media server for a second set of scrub images using a second set of resource identifiers; receiving, by the digital media receiver, the second set of scrub images associated with the digital media, the second set of scrub images having a second set of positions on the timeline that fill time gaps in the first set of positions on the timeline; responsive to the scrub command: animating, by the digital media receiver, a timeline overlay including the timeline, a playhead and a scrub image window; and selecting, by the digital media receiver, a scrub image from the first or second sets of scrub images for presentation in the scrub image window, the selecting based on a playhead position on the timeline.
 7. The method of claim 6, wherein the digital media is a sequence of video frames and the first and second sets of scrub images represent individual frames of the sequence of video frames.
 8. The method of claim 6, wherein the first and second sets of scrub images are received while the digital media is playing on a media presentation device coupled to the digital media receiver.
 9. The method of claim 6, wherein the timeline overlay is animated to move the scrub image window with the playhead in response to scrubbing.
 10. The method of claim 9, wherein selecting a scrub image from the first or second sets of scrub images for presentation in the scrub image window further comprises: selecting, by the digital media receiver, a scrub image having a position on the timeline that is closest to a playhead position on the timeline.
 11. A digital media receiver comprising: a first interface configured to couple to a network; a second interface configured to couple to a remote control device; one or more processors; memory coupled to the one or more processors and storing instructions, which, when executed by the one or more processors, causes the one or more processors to perform operations comprising: receiving, by the digital media receiver at a first time prior to receiving a first scrub command, a first set of scrub images associated with digital media, the first set of scrub images having a first set of positions on a timeline of the digital media; receiving, by the digital media receiver, the scrub command; receiving, by the digital media receiver at a second time after the first time, a second set of scrub images associated with the digital media, the second set of scrub images having a second set of positions on the timeline that fill time gaps in the first set of positions on the timeline; responsive to the first scrub command: animating, by the digital media receiver, a timeline overlay including the timeline, a playhead and a scrub image window; and selecting, by the digital media receiver, a scrub image from the first or second sets of scrub images for presentation in the scrub image window, the selecting based on a playhead position on the timeline.
 12. The digital media receiver of claim 11, wherein the digital media is a sequence of video frames and the first and second sets of scrub images represent individual frames of the sequence of video frames.
 13. The digital media receiver of claim 11, wherein the first and second sets of scrub images are received while the digital media is playing on a media presentation device coupled to the digital media receiver.
 14. The digital media receiver of claim 11, wherein the timeline overlay is animated to move the scrub image window with the playhead in response to scrubbing.
 15. The digital media receiver of claim 14, wherein selecting a scrub image from the first or second sets of scrub images for presentation in the scrub image window further comprises: selecting a scrub image having a position on the timeline that is closest to the playhead position on the timeline.
 16. A digital media receiver comprising: a first interface configured to couple to a network; a second interface configured to couple to a remote control device; one or more processors; memory coupled to the one or more processors and storing instructions, which, when executed by the one or more processors, causes the one or more processors to perform operations comprising: receiving, by a digital media receiver, a playlist including resource identifiers to scrub images associated with a digital media; sending, by the digital media receiver, a first request to a network media server for a first set of scrub images using a first set of resource identifiers; receiving, by the digital media receiver at a first time prior to receiving a scrub command, the first set of scrub images associated with digital media, the first set of scrub images having a first set of positions on a timeline of the digital media; receiving, by the digital media receiver, a scrub command; sending, by the digital media receiver, a second request to the network media server for a second set of scrub images using a second set of resource identifiers; receiving, by the digital media receiver, the second set of scrub images associated with the digital media, the second set of scrub images having a second set of positions on the timeline that fill time gaps in the first set of positions on the timeline; responsive to the scrub command: animating, by the digital media receiver, a timeline overlay including the timeline, a playhead and a scrub image window; and selecting, by the digital media receiver, a scrub image from the first or second sets of scrub images for presentation in the scrub image window, the selecting based on a playhead position on the timeline.
 17. The digital media receiver of claim 16, wherein the digital media is a sequence of video frames and the first and second sets of scrub images represent individual frames of the sequence of video frames.
 18. The digital media receiver of claim 16, wherein the first and second sets of scrub images are received while the digital media is playing on a media presentation device coupled to the digital media receiver.
 19. The digital media receiver of claim 16, wherein the timeline overlay is animated to move the scrub image window with the playhead in response to scrubbing.
 20. The digital media receiver of claim 19, wherein selecting a scrub image from the first or second sets of scrub images for presentation in the scrub image window further comprises: selecting a scrub image having a position on the timeline that is closest to the playhead position on the timeline. 